{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OAxxNGFDJhqk"
   },
   "source": [
    "# Export a Cluster and Ask GPT About It!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JM_SoupgJs-O"
   },
   "source": [
    "## Load ChatGPT\n",
    "\n",
    "The following is an example of an analysis of data collected from GPT-3.5 (ChatGPT) and GPT response dataset. This example was collected using the OpenAI python API below and can be analyzed in Phoenix. The notebook below:\n",
    "\n",
    "* Imports a dataset of previously generated prompt/response pairs \n",
    "* Loads the dataset into Phoenix for analysis \n",
    "* Export a Cluster from Phoenix for further analysis \n",
    "* Ask GPT about the Cluster of Data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qq \"openai>=1\" ipywidgets pandas 'httpx<0.28'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df = pd.read_csv(\n",
    "    \"https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/dataframe_llm_gpt.csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def string_to_array(s):\n",
    "    numbers = re.findall(r\"[-+]?\\d*\\.\\d+|[-+]?\\d+\", s)\n",
    "    return np.array([float(num) for num in numbers])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df[\"prompt_vector\"] = conversations_df[\"prompt_vector\"].apply(string_to_array)\n",
    "conversations_df[\"response_vector\"] = conversations_df[\"response_vector\"].apply(string_to_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dugp6Krv3Thw"
   },
   "source": [
    "Installing Arize to make use of the embeddings generators available for use from the SDK generators package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qq arize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qq 'arize[AutoEmbeddings]'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from arize.pandas.embeddings import EmbeddingGenerator, UseCases\n",
    "\n",
    "if not all(col in conversations_df.columns for col in [\"prompt_vector\", \"response_vector\"]):\n",
    "    generator = EmbeddingGenerator.from_use_case(\n",
    "        use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,\n",
    "        model_name=\"distilbert-base-uncased\",\n",
    "        tokenizer_max_length=512,\n",
    "        batch_size=100,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0_NWVSp83hdK"
   },
   "source": [
    "Generate embeddings for each Prompt and Response column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Very fast on GPU (seconds) but can take a 2-3 minute on a CPU\n",
    "conversations_df = conversations_df.reset_index(drop=True)\n",
    "if not all(col in conversations_df.columns for col in [\"prompt_vector\", \"response_vector\"]):\n",
    "    conversations_df[\"prompt_vector\"] = generator.generate_embeddings(\n",
    "        text_col=conversations_df[\"prompt\"]\n",
    "    )\n",
    "    conversations_df[\"response_vector\"] = generator.generate_embeddings(\n",
    "        text_col=conversations_df[\"response\"]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6fO2154Q3sp0"
   },
   "source": [
    "**Install Phoenix**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -Uqq \"arize-phoenix[embeddings]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import phoenix as px\n",
    "\n",
    "# Define a Schema() object for Phoenix to pick up data from the correct columns for logging\n",
    "schema = px.Schema(\n",
    "    feature_column_names=[\n",
    "        \"step\",\n",
    "        \"conversation_id\",\n",
    "        \"api_call_duration\",\n",
    "        \"response_len\",\n",
    "        \"prompt_len\",\n",
    "    ],\n",
    "    prompt_column_names=px.EmbeddingColumnNames(\n",
    "        vector_column_name=\"prompt_vector\", raw_data_column_name=\"prompt\"\n",
    "    ),\n",
    "    response_column_names=px.EmbeddingColumnNames(\n",
    "        vector_column_name=\"response_vector\", raw_data_column_name=\"response\"\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the dataset from the conversation dataframe & schema\n",
    "conv_ds = px.Inferences(conversations_df, schema, \"production\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Click the link below to open in a view in Phoenix of ChatGPT data\n",
    "px.launch_app(conv_ds).view()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ee96QCiT44Qf"
   },
   "source": [
    "**Download a Cluster!**\n",
    "\n",
    "All you need to do is click the download a cluster button in Phoenix! That is it. The export works by exporting the cluster back to the notebook below in a dataframe. Run the below after you click the download button in Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pre_prompt = \"The following is JSON points for a cluster of datapoints. Can you summarize the cluster of data, what do the points have in common?\\n\"\n",
    "pre_prompt_baseline = \"The following is JSON points for a cluster of datapoints and a baseline sample data of the entire data set. Can you summarize the cluster of data, what do the points have in common and how does it compare to the baseline?\\n\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_cluster_json = px.active_session().exports[-1].prompt.to_json()\n",
    "prompt_baseline_jason = conversations_df.sample(n=10).prompt.to_json()\n",
    "response_cluster_json = px.active_session().exports[-1].response.to_json()\n",
    "chat_initial_input = pre_prompt + prompt_cluster_json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure you have an openAI key setup\n",
    "import getpass\n",
    "import os\n",
    "\n",
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# @title Chat GPT - Cluster Analysis\n",
    "import ipywidgets as widgets\n",
    "from IPython.display import display\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "messages = []\n",
    "\n",
    "# Create the output widget\n",
    "output = widgets.Output(\n",
    "    layout={\"border\": \"1px solid black\", \"width\": \"100%\", \"height\": \"300px\", \"overflow\": \"scroll\"}\n",
    ")\n",
    "\n",
    "# Create the input widget\n",
    "input_box = widgets.Textarea(\n",
    "    value=\"\",\n",
    "    placeholder=\"Type your message here...\",\n",
    "    description=\"\",\n",
    "    disabled=False,\n",
    "    layout=widgets.Layout(width=\"100%\", height=\"100px\"),\n",
    ")\n",
    "\n",
    "# Create the submit button\n",
    "submit_button = widgets.Button(\n",
    "    description=\"Send\", disabled=False, button_style=\"success\", tooltip=\"Send your message\"\n",
    ")\n",
    "\n",
    "# Display the output widget and the input components\n",
    "display(output)\n",
    "display(input_box)\n",
    "display(submit_button)\n",
    "\n",
    "with output:\n",
    "    message = chat_initial_input\n",
    "    if message:\n",
    "        messages.append(\n",
    "            {\"role\": \"user\", \"content\": message},\n",
    "        )\n",
    "        chat = client.chat.completions.create(model=\"gpt-3.5-turbo\", messages=messages)\n",
    "    reply = chat.choices[0].message.content\n",
    "    print(f\"ChatGPT RESPONSE: {reply}\")\n",
    "    print(\"\\n\")\n",
    "    print(\"-- Ask another question related to your data below --\")\n",
    "    messages.append({\"role\": \"assistant\", \"content\": reply})\n",
    "\n",
    "\n",
    "def process_input(input_text):\n",
    "    # Simulate a simple chatbot response (you can replace this with your own logic)\n",
    "    response = f\"You said: {input_text}\"\n",
    "    return response\n",
    "\n",
    "\n",
    "def on_submit_button_click(button):\n",
    "    with output:\n",
    "        user_input = input_box.value.strip()\n",
    "        if user_input:\n",
    "            messages.append(\n",
    "                {\"role\": \"user\", \"content\": user_input},\n",
    "            )\n",
    "            chat = client.chat.completions.create(model=\"gpt-3.5-turbo\", messages=messages)\n",
    "            reply = chat.choices[0].message.content\n",
    "            print(f\"ChatGPT RESPONSE: {reply}\")\n",
    "            messages.append({\"role\": \"assistant\", \"content\": reply})\n",
    "        input_box.value = \"\"\n",
    "\n",
    "\n",
    "# Set the button click event handler\n",
    "submit_button.on_click(on_submit_button_click)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "e62u0uYdAmUS"
   },
   "source": [
    "The example above is just for test purposes and application specific integrations will look different. "
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
